Bulletpapers - Understand complex papers in seconds

May 2024

Self-supervised learning optimizer for safe robot trajectory planning

This paper proposes a self-supervised learning optimizer called SOMTP to enable fast and efficient trajectory planning for robots while ensuring obstacle avoidance through control barrier functions (CBFs). It combines problem transcription, a differentiable correction procedure, and augmented Lagrangian-based training with guide policies to satisfy constraints and achie...

May 2024

Advancing driving perception technologies under challenging real-world conditions

The 2024 RoboDrive Challenge focused on innovating robust perception systems for autonomous vehicles that can withstand diverse disturbances like weather changes and sensor failures. 140 international teams participated, pushing boundaries. Key innovations emerged in data augmentation, sensor fusion, self-supervised learning, and new algorithms that handle sensor incons...

May 2024

Efficient foundation model adaptation for self-supervised endoscopic depth estimation

This paper proposes an efficient method called EndoDAC to adapt foundation models for self-supervised depth estimation in endoscopic video. It uses techniques like Dynamic Vector-Based Low-Rank Adaptation and Convolutional Neck blocks to minimize trainable parameters. It also estimates camera intrinsics for generalization. Experiments show EndoDAC outperforms prior arts...

May 2024

Robust lung tumor segmentation with self-supervised learning

This paper compares self-supervised learning approaches to train deep learning models for lung tumor segmentation. It finds that 'wild' pre-training with diverse medical images makes models more robust to differences in CT scan parameters than using curated task data. Swin Transformer architecture benefits more than Vision Transformer. Key factors are learning local ima...

May 2024

Leveraging audio to enhance robotic manipulation

This paper proposes using contact microphones as a tactile sensor on a robot arm to capture audio information during manipulation tasks. The key insight is that audio captures high-frequency vibrations indicating subtle interactions between surfaces. The authors leverage recent advances in large-scale self-supervised audio-visual learning to obtain an audio encoder capt...

May 2024

Self-supervised modeling for text recognition

This paper proposes Symmetric Superimposition Modeling (SSM), a self-supervised approach for text recognition that captures both character shapes and linguistic context by reconstructing original and inverted images from their superimposition. SSM operates at both pixel and feature levels. At the pixel level, it reconstructs original and inverted images to capture shape...

May 2024

Solar panel identification from aerial imagery

This paper introduces S3Former, a deep learning model to identify and segment solar panels in aerial images. It features a transformer architecture and self-supervised pre-training to address challenges like varying image conditions and lack of domain-specific training data. Key capabilities highlighted include accurately detecting small panels and distinguishing panels...

May 2024

Self-Supervised Wireless Localization

This paper proposes a novel self-supervised wireless localization technique that achieves high accuracy by using time-of-arrival measurements and known transmitter locations. It incorporates laser scanner depth data during training to further improve performance without needing labels. In simulations, the technique achieves sub-meter accuracy 90% of the time, outperform...

May 2024

Self-supervised image denoising with distorted inputs

This paper analyzes a self-supervised algorithm for image denoising that can handle distorted or 'denatured' inputs. Both theoretical analysis and experiments are used to evaluate performance. Key findings show the algorithm can find good solutions for population risk, but performance on test data depends on the difficulty of the distortions. Overall this suggests the a...

April 2024

Principal Mask Proposals for Unsupervised Semantic Segmentation

This paper presents a method called PriMaPs to decompose images into semantically meaningful masks using principal components of self-supervised image features. These masks serve as proposals to guide an expectation-maximization algorithm, PriMaPs-EM, to realize unsupervised semantic segmentation by fitting class prototypes. Despite simplicity, PriMaPs-EM leads to compe...

April 2024

Evaluating image quality by comparison to alternate views

This paper introduces CrossScore, a method to evaluate the quality of an image by comparing it to other images capturing the same scene from different viewpoints. It allows detailed analysis without requiring an aligned 'ground truth' reference image. A neural network leverages cross-attention to compare a query image against multiple reference images, predicting qualit...

April 2024

Self-Supervised Monocular Depth Estimation Using Day Images

This paper proposes a self-supervised framework to train a monocular depth estimation model for nighttime scenes using only daytime images. It applies physical models to compensate for key day-night differences in photometric and noise distributions. Results show state-of-the-art performance on night datasets without using any night images during training.

April 2024

Bias of Video Distance Metric Towards Frame Quality

This paper explores the bias in the commonly used Fréchet Video Distance (FVD) metric towards individual frame quality over temporal consistency. Through controlled experiments, it finds FVD is not very sensitive to temporal distortions. It shows sampling subset of frozen, motion-less videos can still reduce FVD scores. The bias likely comes from supervised features bia...

April 2024

Self-supervised segmentation of visual entities

This paper presents a novel computer vision approach called SOHES that can segment visual entities in images without needing manual image annotations. It works in three main phases: generating high-quality pseudo-labels from unlabeled images, training a model on those pseudo-labels, and refining the model's predictions. A key capability is segmenting not just whole obje...

April 2024

Self-supervised learning for dataset compression

This paper proposes a new framework called SC-DD that uses self-supervised learning to compress datasets more effectively. It trains the compression model using a self-supervised objective which results in more informative distributions of batch norm mean and variance statistics. This allows the model to recover more diverse information during data synthesis compared to...

April 2024

Depth estimation using water reflections

This paper proposes a self-supervised framework to estimate depth maps from single images of water scenes, using the water's reflections as additional viewpoints. A water segmentation network separates reflections, then a photometric loss matches real and reflected views to optimize depth and pose networks. SmoothL1 and a novel PASSIM loss emphasize structural similarit...

April 2024

Masked modeling for understanding motion in crowded scenes

This paper introduces Social-MAE, a transformer-based masked autoencoder framework that uses self-supervised pre-training on multi-person motion data to learn representations that effectively capture nuances of human interactions. A reconstruction task teaches the model to recreate masked joint trajectories. The pre-trained encoder gives state-of-the-art results when fi...

April 2024

Learning how everyday actions sound

This paper proposes a self-supervised model to learn associations between how everyday human actions look, sound, and are described in language. It uses a novel training approach with narrated egocentric video that handles ambiguity between modalities, outperforming prior methods at discovering subtle sounding actions and learning multimodal embeddings evaluated on two ...

April 2024

Self-tuning neural network detects time series anomalies

This paper proposes a self-tuning neural network called TSAP that can detect various types of anomalies in time series data without labeled data. TSAP has a differentiable augmentation module to create pseudo anomalies and an unsupervised loss to align augmentations with real anomalies. Experiments show TSAP can effectively select augmentations and outperforms baselines...

April 2024

Self-supervised heterogeneous graph learning

This paper proposes a novel self-supervised neural network model called GC-HGNN for learning useful representations on heterogeneous graphs, which have multiple node and edge types. It combines graph generative learning and contrastive learning techniques. A masked autoencoder generates augmented views for contrastive learning without altering the graph structure. Posit...

March 2024

Self-supervised image retrieval with open instructions

This paper introduces MagicLens, a series of self-supervised image retrieval models that can follow open-ended text instructions to find relevant images. The key insight is that image pairs naturally co-occurring on web pages contain diverse implicit relations beyond visual similarity. By using large language models to make those relations explicit as text instructions,...

March 2024

Self-supervised learning for noise-robust keyword spotting

This paper explores using self-supervised learning to improve the noise robustness of keyword spotting models. Models of 3 sizes are pretrained with Data2Vec on clean or noisy data, then fine-tuned. Results show pretraining helps even without noisy data, outperforming supervised models. Using noisy data for pretraining boosts robustness further. The Data2Vec-denoising a...

March 2024

Self-supervised indoor depth learning

This paper proposes a self-supervised framework called F2Depth to address the challenge of estimating depth from monocular images in indoor scenes. It introduces an optical flow network to supervise depth learning. A patch-based photometric loss focuses on discriminative features to improve optical flow accuracy. The finetuned flow network provides optical flow supervis...

March 2024

Self-supervised learning of cardiac MRI views

This paper proposes two complementary self-supervised pretext tasks to pretrain deep networks on cardiac MRI data with anatomy-oriented imaging planes: 1) regressing the relative orientations between standard cardiac views by predicting their intersecting lines, and 2) regressing the relative slice locations within a parallel stack. Experiments on multi-structural segme...

March 2024

Hierarchical text and image alignment for histopathology representation learning

This paper proposes a new self-supervised learning framework called HLSS that aligns hierarchical natural language descriptions with visual features in histopathology images across patient, slide, and patch levels. This helps the model learn improved representations that achieve state-of-the-art performance on downstream tasks and provide better interpretability.

March 2024

Impact of data diversity on self-supervised learning

This paper explores how training self-supervised learning models on more diverse datasets impacts performance. The key findings are that more diversity helps, but only when the distribution of data matches the end task. Large diversity from web data or AI-generated data still struggles to offset distribution differences. The experiments cover 7 methods over 200 GPU days.

March 2024

Wavelet Image Alignment for Self-Supervised Low-Dose CT Denoising

This paper proposes a self-supervised method to denoise low-dose CT images using only normal-dose CT data. It introduces two key ideas - aligning normal and low-dose CT images in the frequency domain using wavelets, and a multi-scale loss function focused on high-frequency components. Experiments showed state-of-the-art performance on public datasets compared to other s...

March 2024

Grouping Points for Semantic-Aware 3D Representation Learning

This paper proposes GroupContrast, a self-supervised framework to learn effective 3D representations by combining segment grouping and semantic-aware contrastive learning. Segment grouping partitions unlabeled point clouds into semantically coherent regions to provide guidance for subsequent contrastive learning. By constructing positive pairs within groups and negative...

March 2024

Robust medical imaging via counterfactual contrastive learning

This paper proposes a new self-supervised learning method, CF-SimCLR, that uses counterfactual image generation to create more realistic data augmentations for contrastive learning. It is evaluated on chest x-rays and mammograms across 5 datasets. Results show CF-SimCLR substantially improves model robustness to differences in image acquisition and generalizability to n...

March 2024

Improving force prediction with denoising non-equilibrium atomistic structures

This paper proposes a technique called DeNS to improve neural network performance at predicting forces and energies in atomistic systems. DeNS works by corrupting 3D atomic structures with noise, and then training models to denoise them by predicting that exact noise. This acts as an auxiliary self-supervised task. The key insight is encoding the forces of the original ...

March 2024

Masked modeling of multi-view video for autonomous driving

This paper proposes a novel pre-training approach called MIM4D that uses masked modeling on multi-view video inputs to learn visual representations for autonomous driving tasks. It models both spatial and temporal relations by reconstructing missing voxel features in 4D space. This allows capturing dynamic scene flow while also learning geometric structure through diffe...

March 2024

Enhancing object discovery through voting on self-supervised models

This paper introduces VoteCut and CuVLER, innovative methods that leverage multiple self-supervised vision transformer models to significantly improve unsupervised object discovery and segmentation. Key innovations include: graph partitioning and clustering algorithms to generate mask proposals; confidence scoring of proposals; refined training procedures. Evaluations a...

March 2024

Self-supervised learning model for assessing children's painting aesthetics

This paper proposes a self-supervised learning model to assess the aesthetic qualities of children's paintings from multiple perspectives, overcoming limitations of previous methods that rely on large labeled datasets. A novel dataset is constructed containing over 20,000 unlabeled generated paintings and 1,200 expert-labeled real paintings with 8 aesthetic attributes. ...

March 2024

Self-learning for night image visibility enhancement

This paper proposes a self-learning method to improve the visibility of nighttime images obscured by haze and other degradations. It uses a technique inspired by Masked Autoencoders (MAE) that intentionally degrades clear night images during training, forcing the model to learn robust representations that generalize to real-world night haze images. A key contribution is...

March 2024

Learning environment-specific communication for multi-agent teams

This paper introduces a method to learn a reusable, environment-specific communication strategy for multi-agent teams that works across tasks in the same environment. It is trained in a self-supervised, task-agnostic way using set autoencoders on agent observations, approximating the global state. This enables adapting to new tasks without retraining communication and s...

March 2024

Enhancing image recognition through integrating out-of-distribution detection into partial label ...

This paper introduces a new partial label learning (PLL) approach called PLL-OOD that integrates out-of-distribution (OOD) detection into the PLL framework for the first time. This allows the model to accurately classify in-distribution data while identifying and handling OOD data points that differ from the training distribution. The method improves feature representat...

March 2024

Detecting Car Damage and Aligning Images

This paper presents a Mask R-CNN model to accurately detect types of damage on cars from images. It also proposes a new self-supervised learning method to align pre and post rental car images to find differences, instead of traditional computer vision alignment techniques.

March 2024

Customized prototype learning for few-shot image segmentation

This paper proposes a new few-shot image segmentation method called Query-guided Prototype Evolution Network (QPENet). It creates customized foreground and background prototypes tailored to each query image's specific needs. This is done through an iterative process that evolves the prototypes using both support and query features. Key innovations are pseudo-prototype g...

March 2024

Self-supervised patient modeling from multi-sensor images

The paper proposes a modular 3D patient modeling method using multi-modal images like CT and MRI. It has an attentive fusion module to detect patient joints robustly from color and depth images. Also, a self-supervised mesh regressor is used to estimate the 3D body shape without needing expensive 3D shape annotations. Experiments show superior performance in clinical pa...

February 2024

Self-supervised multimodal recommendation

This paper proposes a self-supervised learning framework called MENTOR for multimodal recommendation. MENTOR introduces new techniques to align features from different modalities like text and images, while retaining user preference information. It also enhances the model's robustness. Experiments show MENTOR improves accuracy over existing methods.

February 2024

Removing distortion in videos using deep learning

This paper proposes a deep learning method to remove atmospheric distortion in videos. It is self-supervised, so it trains on the input video itself without requiring other training data. This allows it to work well on any specific video content. It builds on Deep Image Prior techniques but adds temporal processing using pixel shuffling over frames in a sliding window. ...

February 2024

Self-supervised multitask learning for review helpfulness prediction

This paper proposes a self-supervised multitask learning approach to predict the helpfulness of multimodal reviews. It generates pseudo-labels to train subtasks that capture consistency and differentiation between text and images. Experiments show it outperforms previous textual and multimodal methods on two e-commerce datasets.

February 2024

Self-supervised learning for pediatric tuberculosis detection

This paper proposes a self-supervised learning approach using Vision Transformers to improve tuberculosis detection in chest X-rays. When pre-trained on adult data and tested on pediatric data, this method achieved top performance of 0.697 AUC, demonstrating effective knowledge transfer to pediatric tuberculosis where data is scarce. The approach shows promise for impro...

February 2024

Overcoming collapse in self-supervised medical image learning

This paper investigates issues applying contrastive self-supervised learning to medical images. It finds that the high inter-image similarity causes 'dimensional collapse', limiting feature richness. To address this, two techniques are proposed: 1) local feature learning, to focus on distinguishing local regions rather than global features, and 2) feature decorrelation,...

February 2024

Enhancing code models with comment augmentation

This paper examines the impact of aligning code with natural language comments in pre-training data for code-focused language models. To address the scarcity of aligned data, they introduce a method to generate comments for existing code using a model tuned for comment generation. This augmented data, coupled with filtering strategies, is used to further train models an...

February 2024

Self-supervised graph learning for repeat detection

This paper introduces GraSSRep, a new approach using graph neural networks and self-supervised learning to accurately detect repetitive DNA sequences in metagenomic data. It frames repeat detection as a node classification problem on an assembly graph. High-precision heuristics generate noisy labels for some nodes, then a graph neural network and random forest classifie...

February 2024

Self-supervised object detection and part-whole understanding

This paper proposes a novel approach called HASSOD that advances self-supervised object detection by grouping image regions into masks representing object boundaries. It also identifies hierarchical levels of objects as wholes, parts, or subparts by analyzing coverage relationships between masks. Key benefits are improved detection performance, enhanced interpretability...

January 2024

Detecting folding patterns in brain cortex

This paper trains a self-supervised deep learning model called SimCLR to detect folding patterns in the cingulate region of the brain cortex, using MRI scans from over 20,000 subjects. The model is optimized and evaluated by its ability to detect a specific "double parallel" folding pattern related to schizophrenia. The best model uses a convolutional neural network bac...

January 2024

Self-supervised learning of nerve fiber patterns

This paper proposes a new method to characterize nerve fiber architecture in microscopic brain images, using self-supervised representation learning. A 3D-Context Contrastive Learning approach is introduced that samples similar fiber architecture examples across nearby sections of a 3D reconstructed brain volume as positive pairs to train an encoding model. This allows ...

January 2024

Deconstructing diffusion models for self-supervised image representation learning

This paper deconstructs modern Denoising Diffusion Models (DDMs), which are powerful generative models, to understand their ability for self-supervised representation learning. Through step-by-step simplification, the authors push DDM towards a classical Denoising Autoencoder (DAE), finding that only a few key components are critical. Ultimately they arrive at a simplif...

January 2024

Parameter-efficient fine-tuning advances medical imaging models

This paper explores using parameter-efficient fine-tuning (PEFT) methods like LoRA on self-supervised chest X-ray foundation models, showing PEFT can match or exceed full fine-tuning performance while using under 1% of parameters. On small datasets, PEFT advanced state-of-the-art. The authors hope this spurs more PEFT research for medical imaging.

January 2024

Predicting open-vocabulary 3D scene occupancy from images

This paper proposes an approach to predict a 3D voxel occupancy map from 2D images, enabling open-vocabulary tasks like zero-shot segmentation and language-based retrieval. A model is designed with 2D-3D encoding, occupancy prediction, and 3D-language heads. It's trained via self-supervision using images, language features, and LiDAR, not needing manual 3D annotations. ...

January 2024

Self-emerging token labeling for vision transformers

This paper proposes a self-emerging token labeling framework to improve the pre-training of vision transformers. It contains two stages - first training a vision transformer token labeler to generate semantic token labels, then training a student model using both original labels and self-emerging token labels. The best model achieves state-of-the-art accuracy on ImageNe...

January 2024

Semi-supervised image classification with self-supervision

This paper explores a semi-supervised learning framework called Color-S4L that integrates self-supervised pretext tasks like image colorization to improve image classification with limited labels. It shows competitive performance on CIFAR and SVHN datasets compared to prior semi-supervised methods.

January 2024

Self-supervised learning for speech recognition using task-oriented dialogues

This paper introduces a family of contrastive self-supervised learning methods called CLC that leverage artifacts from unsuccessful task-oriented dialogues to improve automated speech recognition models. The key ideas are: (1) maximize agreement between current, past and future utterance embeddings within a dialogue while minimizing agreement between dialogues, and (2) ...

December 2023

Unconstrained 3D Reconstruction Made Easy

This paper introduces DUSt3R, a novel approach to dense 3D reconstruction that operates without prior information about camera calibration or viewpoint poses. It formulates the problem as regression of 'pointmaps', which encode both scene geometry and pixel-to-3D correspondences. This formulation unifies monocular and binocular reconstruction cases. For sets of images, ...

December 2023

Audio-based active speaker detection and localization

This paper proposes an audio-only neural network that can simultaneously detect and locate active speakers in video frames using input from a microphone array. It is trained using a self-supervised student-teacher approach, where an existing audio-visual detector provides supervision. At test time, it can locate speakers even when faces are occluded. Experiments showed ...

December 2023

Stronger segmentation with descriptive properties

This paper proposes using descriptive properties from large language models, instead of one-hot category labels, to supervise segmentation models. Properties are clustered into an interpretable label space. This enhances model performance, scalability and generalization ability.

December 2023

Learning by predicting hard patches to mask

This paper proposes a new pretraining approach called Hard Patches Mining (HPM) which makes models predict which image patches will be hard to reconstruct, and then masks those patches as a challenging pretext task. This allows models to generate their own difficult problems rather than just solve given problems. Experiments show HPM brings significant gains.

December 2023

Self-supervised multi-camera 3D scene reconstruction

This paper proposes OccNeRF, a method to predict 3D occupancy and geometry from multi-camera images in a self-supervised fashion, without ground truth 3D or 2D labels. It handles unbounded scenes by parameterizing occupancy fields. Multi-frame photometric consistency provides supervision, aided by an open-vocabulary segmentation model for semantics.

December 2023

Weakly Supervised 3D Object Detection with Visual Guidance

This paper proposes a framework to train a 3D object detector using only 2D image annotations as supervision. It establishes connections between the 2D and 3D data from three perspectives: aligning image and LiDAR features based on object regions (feature-level), enforcing overlap between 2D boxes and projected 3D boxes (output-level), and generating consistent 2D and 3...

December 2023

Self-supervised patch classification

This paper introduces NearbyPatchCL, a self-supervised learning method that leverages nearby patches in whole-slide images as positive samples to learn robust representations for patch-level multi-class classification. A new benchmark dataset called P-CATCH is curated from canine histology images for evaluation. Experiments show NearbyPatchCL significantly outperforms s...

December 2023

Self-guided semantic image segmentation

The authors propose a novel framework called Self-Seg that can automatically detect and segment objects in images without needing any textual input specifying class names. Self-Seg uses a vision-language model called BLIP to cluster image regions and generate captions describing each cluster. These captions are filtered to extract noun class names, which guide a semanti...

December 2023

Estimating road scene depth from camera height consistency

This paper introduces a self-supervised method to estimate metric depth from monocular videos, without needing ground truth depth or scale supervision. It works by enforcing consistency of estimated camera height across video frames, aggregating object size cues. This achieves state of the art accuracy compared to related weakly-supervised methods.

December 2023

Preventing overfitting in Barlow Twins with mixed samples

The Barlow Twins algorithm for self-supervised learning can overfit when embedding dimensions get large, degrading representation quality over time. This paper introduces a mixed sample regularization technique that improves sample interaction in Barlow Twins training. Assuming linear interpolations in input space lead to linear interpolations in embedding space, they f...

December 2023

Self-Supervised Learning for Functional Connectivity Networks

This paper proposes a self-supervised learning framework tailored for functional connectivity networks from fMRI data. It introduces Spatio-Temporal Masked Autoencoder (ST-MAE) to effectively capture both spatial graph structure and temporal dynamics. Pre-trained on a large UK Biobank fMRI dataset, it demonstrates superior performance over baselines in downstream tasks ...

November 2023

Pose estimation for unseen objects

This paper proposes FoundPose, a method to estimate the 6D pose (3D rotation and translation) of previously unseen rigid objects from a single RGB image, using the object's 3D model but without any training data. It builds on top of DINOv2, a state-of-the-art self-supervised vision model, to match image patches between the query image and synthetically rendered views of...

November 2023

Lightweight clustering for semantic segmentation

This paper proposes a lightweight clustering framework to perform semantic segmentation without labels. It utilizes attention features from self-supervised vision transformers, which have strong foreground/background differences. These features are clustered into groups at the dataset, category, and image levels. Consistency across levels extracts high-quality binary ps...

November 2023

Iterative pixel grouping for self-supervised visual representation learning

This paper proposes Perceptual Group Tokenizer (PGT), a self-supervised visual representation learning model that relies entirely on iterative perceptual grouping operations to extract features. PGT iteratively groups pixels into contextual tokens, refining representations over multiple rounds. It achieves 80.3% top-1 accuracy on ImageNet, competitive with state-of-the-...

November 2023

Diffusion models for visual representation learning

This paper introduces SODA, a self-supervised diffusion model for both robust visual representation learning and high-quality image generation. By adding an image encoder and imposing a compact latent bottleneck between it and the diffusion decoder, SODA is able to capture semantic information that aids downstream classification tasks. When trained for novel view synthe...

November 2023

Self-supervised video object segmentation via attention

This paper proposes a self-supervised approach for video object segmentation that leverages the structural dependencies and emerging objectness present in DINO-pretrained vision transformers. It introduces a simplified architecture with a single spatio-temporal transformer block on top of DINO features to establish robust correspondences across frames in the form of att...

November 2023

Improving reasoning through spatial modeling

This paper proposes a method to model spatial relationships between objects in an image to improve visual reasoning. The authors construct a spatial relation graph and introduce two pretraining tasks, object position regression and spatial relation classification, to reconstruct graph properties. When incorporated into model pretraining, this approach strengthens unders...

November 2023

Self-supervised learning is more efficient than foundation models for cardiac ultrasound segmenta...

This paper compared a foundation model (SAM) to a self-supervised learning model for segmenting cardiac chambers in ultrasound images. The self-supervised model achieved better performance and required far fewer computational resources and no manual labeling, demonstrating greater efficiency than the foundation model.

November 2023

Self-supervised visual relationship learning

This paper proposes a novel self-supervised approach to learn visual relationship representations without manual annotations. The key idea is Masked Bounding Box Reconstruction (MBBR), where object features in a scene are randomly masked and must be reconstructed from unmasked context. This forces the model to learn the interactions between objects. The method is shown ...

November 2023

Using lidar data to train image segmentation models

This paper proposes a novel approach that leverages lidar point cloud annotations to train image segmentation models, avoiding the need for labor-intensive image annotation. Lidar provides sparse 3D points that are projected onto images as labels for training. A masked loss function enables models to learn from sparse lidar-derived labels. Experiments show comparable pe...

November 2023

Self-supervised learning for target speech extraction

This paper proposes a self-supervised learning method called SHuBERT to extract representations of a target speaker's speech from clean or mixture signals. It uses an enrolled utterance from the target speaker to guide selective attention, and a dual-path contrastive learning strategy for noise robustness.

November 2023

Learning point cloud representations with images

This paper proposes a new method called PRED for pre-training point cloud encoders using multi-view images. It addresses challenges like point cloud incompleteness and occlusion when aligning images and point clouds. The key idea is to render semantic maps from images conditioned on a point cloud feature map, providing supervision through neural rendering. This allows m...

November 2023

Automated detection of social behavior patterns

This paper introduces LISBET, a machine learning model that analyzes videos of mouse social interactions. LISBET detects patterns of social behavior without needing human labeling. The self-supervised model works by looking for important features like body part positions and movements over time. LISBET can identify frequent behaviors and transitions between them. It ali...

November 2023

Using CLIP for sound localization

This paper proposes using the pre-trained CLIP model for sound source localization, without explicit text queries. An audio tokenizer module translates audio into tokens compatible with CLIP's text encoder, to generate audio-driven embeddings. These embeddings localize sounding regions in images via an audio-visual grounder. Audio-grounded visual features are extracted ...

November 2023

Local image transformations with random fields for self-supervised learning

This paper proposes using Gaussian random fields to generate diverse local image transformations for self-supervised representation learning. The random field transformations generalize standard affine and color augmentations by allowing transformation parameters to vary spatially. Experiments show these flexible transformations improve accuracy, but hyperparameters mus...

November 2023

Learning to reuse manipulation strategies

This paper presents a framework that enables robots to learn reusable manipulation strategies ('mechanisms') from a single demonstration and subsequent self-play. The key idea is representing mechanisms as sequences of contact mode changes between robot and objects. The system extracts this sequence from a demonstration, then trains a specialized sampler via self-play. ...

November 2023

Efficient self-supervised sentence embedding

This paper proposes a new framework for learning sentence embeddings without human-labeled data. It uses a novel cross-view training approach to create robust sentence representations, especially for smaller pretrained language models.

November 2023

Low-light image enhancement using Retinex decomposition

This paper proposes a new method for enhancing low-light images called ZERRINNet. It decomposes images into reflection, illumination and noise components using deep learning. This allows it to address multiple issues like noise, color distortion and contrast simultaneously. It does not require any training data, instead enhancing images directly. Experiments showed it o...

November 2023

Learning user preferences from behavior sequences

This paper proposes a framework to learn user preferences from sequences of user behavior over time. The framework uses graph learning techniques and self-supervised learning to capture both local transitions within sequences and global correlations across sequences. This allows it to learn more robust user representations.

November 2023

Multi-channel molecular representation learning

This paper introduces a new framework for learning molecular representations that captures different structural perspectives via multiple self-supervised learning channels. It leverages knowledge of hierarchies within molecules by having channels focus on global, partial, and local views. The global view contrasts entire molecules, the partial view highlights scaffolds,...

November 2023

Self-attention versus convolution in audio models

This paper investigates using different types of encoders in self-supervised audio models. It shows simple transformers with only self-attention can achieve comparable efficiency to models mixing self-attention and convolution (like Conformer). This is especially true with low-bit quantization, since quantization errors don't accumulate across different modules. Overall...

November 2023

Self-supervised learning of periodic videos

This paper proposes CycleCL, a self-supervised learning method to extract useful features from periodic videos. It trains a model by sampling pairs of frames showing the same cycle phase, and pairs showing different phases. This teaches the model to produce features sensitive to phase progression, but invariant to repetition.

November 2023

Self-supervised learning for class imbalance in graphs

This paper proposes VIGraph, a self-supervised learning method to address class imbalance for node classification in graphs. It uses a variational graph autoencoder with careful training strategies to generate high-quality synthetic minority class nodes. This avoids limitations of prior SMOTE-based approaches that lack rigor in graph construction and node generation. Ex...

November 2023

Self-supervised terrain learning for efficient building detection

This paper proposes a self-supervised approach to learn useful features from unlabeled LiDAR data, specifically for building detection. By training a model to differentiate terrain from above-ground structures, it can extract useful patterns without manual labels. When transferred to building segmentation, this method shows improved performance and label efficiency over...

November 2023

Self-supervised video representation learning using masked reconstruction

This paper proposes a new method called Concatenated Masked Autoencoders (CatMAE) for self-supervised video representation learning. It masks the majority of patches in video frames after the first frame, and trains the model to reconstruct the original unmasked frames. This forces the model to understand motion and correspondences between frames in order to reconstruct...

November 2023

Self-supervised learning improves counterfactual estimation

This paper proposes a new method called COSTAR that uses self-supervised learning to improve counterfactual outcome estimation from time series data. The model is tailored for temporal data and handles challenges like long-range dependencies. It demonstrates superior accuracy and cross-domain generalization.

November 2023

Learning clinical feature embeddings

This paper explores self-supervised training methods to learn universal embeddings for clinical measurements like heart rate and blood pressure. The methods use language model objectives like continuous bag of words and masked language modeling. The clinical embeddings show interpretable structure and relationships when visualized. The embeddings also improve performanc...

November 2023

Learning visual representations from aligned radar and optical satellite data

This paper presents a framework called CROMA that learns useful visual representations from aligned radar and optical satellite imagery. CROMA combines contrastive and reconstruction self-supervised objectives to produce sensor-invariant and optionally multimodal representations. It introduces relative position encoding strategies that enable extrapolation to larger ima...

November 2023

Self-supervised learning with mixed nearest neighbors

This paper proposes a self-supervised learning method called MNN that enhances model performance by incorporating nearest neighbors as additional positive samples. It handles false neighbors through a weighting strategy and image mixing in feature space. MNN achieves strong results on image classification benchmarks while adding minimal overhead.

November 2023

Learning representations for time series by reconstructing motifs

This paper introduces a new method called REtrieval-BAsed Reconstruction (REBAR) for learning representations of time series in a self-supervised, contrastive way. It focuses on identifying similar motifs across subsequences. If one subsequence can reconstruct missing motifs in another subsequence, it suggests they share semantics and should be pulled together in the em...

November 2023

Analysis of speech model representations for emotion recognition

This paper analyzes several signal-based and neural network-based speech features on the task of emotion recognition, using six standard datasets. The features come from Mel spectrograms, MFCCs, speaker embedding networks, and self-supervised models. Using simple classifiers, the results show certain model representations enclose inherent information about emotions with...

November 2023

Mining samples to improve contrastive visual representation learning

This paper proposes a new approach to selecting positive and negative samples for contrastive self-supervised learning. It argues that common practices like using only augmented views as positives or random negatives can be limiting. Instead, it carefully mines additional positive and negative samples in a principled way. For positives, it combines augmented views with ...

October 2023

Improving Large Language Model Data Generation

This paper proposes methods to make large language models better at generating high-quality training data for downstream NLP tasks. The key ideas are using a unified data creation pipeline requiring only a single formatting example, and introducing 'self-reference' strategies to iteratively sample from newly created data to diversify prompts.

October 2023

Learning visual representations without negative sample contrast

This paper proposes a new self-supervised learning approach called tri-factor contrastive learning (triCL). It replaces the traditional 2-factor contrast used in methods like SimCLR with a 3-factor form that incorporates a learnable diagonal importance matrix S. This enables exact feature identifiability and discovers feature importance, with theoretical guarantees. Emp...

October 2023

Learning robust visual features from imbalanced data

This paper proposes a method to learn robust visual features from imbalanced datasets without labels. It points out issues with standard contrastive self-supervised learning, which can lead to poor feature quality for rare classes. The method uses a geometric harmonization technique to balance the feature space across classes. Key aspects are measuring feature statistic...

The history of self-supervised learning